Generalized Biwords for Bitext Compression and Translation Spotting: Extended Abstract
نویسندگان
چکیده
The increasing availability of large collections of bilingual parallel corpora has fostered the development of naturallanguage processing applications that address bilingual tasks, such as corpus-based machine translation, the automatic extraction of bilingual lexicons, and translation spotting [Simard, 2003]. A bilingual parallel corpus, or bitext, is a textual collection that contains pairs of documents which are translations of one another. In the words of Melamed [2001, p. 1], “bitexts are one of the richest sources of linguistic knowledge because the translation of a text into another language can be viewed as a detailed annotation of what that text means”. Large bitexts are usually available in a compressed form in order to reduce storage requirements, to improve access times [Ziviani et al., 2000], and to increase the efficiency of transmissions. However, the independent compression of the two texts of a bitext is clearly far from efficient because the information contained in both texts is redundant. Previous work [Nevill-Manning and Bell, 1992; Conley and Klein, 2008; Martı́nez-Prieto et al., 2009; Adiego et al., 2009; 2010] has shown that bitexts can be more efficiently compressed if the fact that the two texts are mutual translations is exploited. Martı́nez-Prieto et al. [2009], and Adiego and his colleagues [2009; 2010] propose the use of biwords —pairs of words, each one from a different text, with a high probability of co-occurrence— as input units for the compression of bitexts. This means that a biword-based intermediate representation of the bitext is obtained by exploiting alignments, and unaligned words are encoded as pairs in which one component is the empty string. Significant spatial savings are achieved with this technique [Martı́nez-Prieto et al., 2009], although the compression of biword sequences requires larger dictionaries than the traditional text compression methods. The biword-based compression approach works as a simple processing pipeline consisting of two stages (see Figure 1). After a text alignment has been obtained without pre-existing linguistic resources, the first stage transforms the bitext into a biword sequence. The second stage then compresses this sequence. Decompression works in reverse order: the biword sequence representing the bitext is first generated
منابع مشابه
Generalized Biwords for Bitext Compression and Translation Spotting
Large bilingual parallel texts (also known as bitexts) are usually stored in a compressed form, and previous work has shown that they can be more efficiently compressed if the fact that the two texts are mutual translations is exploited. For example, a bitext can be seen as a sequence of biwords —pairs of parallel words with a high probability of cooccurrence— that can be used as an intermediat...
متن کاملHarnessing the Redundant Results of Translation Spotting
Translation spotting consists in automatically identifying the translations of a user query inside a bitext. This task, when it relies solely on statistical word alignment algorithms, fails to achieve excellent results. In this paper, we show that identifying the translations of a query during a first translation spotting stage provides relevant information that can be used in a second stage to...
متن کاملBoosting Bitext Compression
Bilingual parallel corpora, also know as bitexts, convey the same information in two different languages. This implies that when modelling bitexts one can take advantage of the fact that there exists a relation between both texts; the text alignment task allow to establish such relationship. In this paper we propose different approaches that use words and biwords (pairs made of two words, each ...
متن کاملA Two-Level Structure for Compressing Aligned Bitexts
A bitext, or bilingual parallel corpus, consists of two texts, each one in a different language, that are mutual translations. Bitexts are very useful in linguistic engineering because they are used as source of knowledge for different purposes. In this paper we propose a strategy to efficiently compress and use bitexts, saving, not only space, but also processing time when exploiting them. Our...
متن کاملAn Attentional Model for Speech Translation Without Transcription
For many low-resource languages, spoken language resources are more likely to be annotated with translations than transcriptions. This bilingual speech data can be used for word-spotting, spoken document retrieval, and even for documentation of endangered languages. We experiment with the neural, attentional model applied to this data. On phoneto-word alignment and translation reranking tasks, ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013